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Abstract: A fast direct inversion scheme for the large sparse systems of 
linear equations resulting from the discretization of elliptic partial differ- 
ed I ential equations in two dimensions is given. The scheme is described for 

■ the particular case of a discretization on a uniform square grid, but can be 
, generalized to more general geometries. For a grid containing N points, the 
I scheme requires 0{N log^ N) arithmetic operations and 0{N log N) storage 

^ ■ to compute an approximate inverse. If only a single solve is required, then 

I the scheme requires only 0{^/N logN) storage; the same storage is suf- 

0^ ' ficient for computing the Dirichlet-to-Neumann operator as well as other 

^ . boundary-to-boundary operators. The scheme is illustrated with several 

I— 1 1 numerical examples. For instance, a matrix of size 10^ x 10^ is inverted to 

' seven digits accuracy in four minutes on a 2.8GHz P4 desktop PC. 

^: 

^ . 1. Introduction 

» . 

2 ' This note describes a scheme for rapidly solving the systems of linear equations 

^H^. arising from the finite element or finite difference discretization of elliptic partial 

I differential equations in two dimensions. Unlike most existing fast schemes which 

rely on iterative solvers (GMRES / conjugate gradient / . . . ), the method described 
here directly inverts the matrix of the linear system. This obviates the need for 
I customized pre-conditioners, and leads to dramatic speed-ups in environments in 

■ which the same equation needs to be solved for a sequence of different right-hand 
', sides. 

' The scheme is described for the case of equations defined on a uniform square 

. grid. Slight modifications would enable the handling of uniform grids on fairly 

I regular two-dimensional domains, but further work would be required to construct 

' methods for non-uniform grids or complicated geometries. 

• 1— I . For a system matrix of size NxN (corresponding to a ^/N x \/]V grid), the scheme 

^ I requires 0{N log^ N) arithmetic operations, and 0{N log A^) storage to compute the 

■ full inverse. For a single solve, only 0{\/lV logN) storage is required. Moreover, 
for problems loaded on the boundary only, any solves beyond the first require only 
0{^/N log A^) arithmetic operations (provided that only the solution on the bound- 
ary is sought). Numerical experiments indicate that the constants in these asymp- 
totic estimates are quite moderate. For instance, to directly solve a system involving 
a 1 000 000 X 1 000 000 matrix to seven digits of accuracy takes about four minutes 
on a 2.8GHz desktop PC with 512Mb of memory. Additional solves beyond the first 
can be performed in 0.03 seconds (provided only boundary data is involved). 

The scheme is inspired by earlier work by Hackbusch and co-workers [?, ?, ?, 
?] and work by Gu and Chandrasekaran [?, ?, ?, ?]. These authors have pub- 
lished methods that rapidly perform algebraic operations (matrix-vector multiplies, 
matrix-matrix multiplies, matrix inversion, etc) on matrices whose off-diagonal 
blocks have low rank. Our scheme is similar, but uses simpler data structures. 
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The fast inversion scheme retains its 0(A^log^ A^) computational complexity for 
a wide range of network matrices. It does not rely on the fact that the matrix is 
associated with a PDE; rather, it works for any matrix whose inverse has off-diagonal 
blocks of low rank (what Hackbusch calls an "H-matrix, and Gu and Chandrasekaran 
calls an "HSS-matrix"). 

The paper is structured as follows: Section [2] describes some known algorithms for 
performing fast operations on matrices with off-diagonal blocks of low rank. Section 
[3] describes a very simple 0{N'^) inversion scheme. Section [4] describes how the 
0{N'^) scheme of Section [3] can be accelerated to 0(A^log^ N) using the methods of 
Section [21 Section [6] give the results of numerical experiments. 

2. Preliminaries 

In this section, we summarily describe a class of matrices for which there are 
known algorithms for rapidly evaluating the result of algebraic operations such as 
matrix- vector products, matrix-matrix products, and matrix inversions. The basic 
concept is that the off-diagonal blocks of the matrix can, to within some preset 
accuracy e, be approximated by low-rank matrices. Different authors have given 
different definitions, and provided different algorithms; we mention that Hackbusch 
refers to such matrices as W-matrices [?], or "H^-matrices [?], while Gu and Chan- 
drasekaran refers to them as "Hierarchically Semi-Separable" (HSS) matrices [?] or 
"Sequentially Semi-Separable" (SSS) matrices [?]. 

In this paper we will use the term HSS matrix to refer to an n x n matrix B that 
can be tessellated in the pattern shown in Figure 12.11 in such a way that the rank 
of every block in the tessellation is bounded by some fixed small number p. Such a 
matrix can clearly be stored in 0{pnlogn) storage, and can be applied to a vector 
in 0{pnlogn) arithmetic operations. Moreover, via a trivial recursion (described 
in AppendixE]), it is possible to invert such a matrix in O(pnlog^n) operations. 

Remark 2.1. We will in this paper not distinguish between a matrix that is exactly 
an HSS-matrix, and a matrix that can to high precision be approximated by an HSS- 
matrix. 

Remark 2.2. We use the term SSS-matrix to refer to a matrix that is an HSS- 
matrix but has the additional property the bases for the row and column spaces 
of the off-diagonal blocks can be expressed hierarchically. As an example, a basis 
for the column space of the block labeled (4, 5) is constructed from the bases for 
the column spaces of the blocks (8,9) and (9,8). An SSS-matrix requires only 
0{N) storage, and can be applied to a vector in 0{N) operations. Matrix-inversion 
schemes for SSS-matrices are a big more complicated than HSS inversion schemes, 
but in some environments, 0{N) inversion schemes exist, see e.g. [?]. 

Remark 2.3. In Hackbusch's terminology, what we call an HSS-matrix roughly 
corresponds to an "H-matrix, and what we call an SSS-matrix, roughly corresponds 
to an "H^-matrix. The tessellations used by Hackbusch are slightly different, though. 
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Figure 2.1. Tessellation of an HSS matrix. 
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Figure 3.1. The computational grid for n = 6 
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3. An exact 0{N'^) inversion scheme 

Letting n be a positive even integer, we consider a static conduction problem on 
a square (n + 2) x (n + 2) grid such as the one illustrated in Figure [3TT1 Each interior 
node is connected to its four nearest neighbors with bars with positive conductivity. 
We prescribe the temperatures on the boundary nodes (marked with filled circles in 
Figure IXTI) . and seek the equilibrium temperatures of the N internal nodes, where 
N = n?. This problem can be formulated as a linear system 

(3.1) Ax = b, 

where A is an x sparse matrix, x is an A x 1 vector of unknown temperatures, 
and 6 is an A X 1 vector derived from the pre-scribed boundary temperatures. When 
all bars have unit conductivity, the matrix A in (j3.ip is the well-known five-point 
stencil with coefficients 4/— 1/— 1/— 1/— 1. 

In this section we describe a method for directly solving the linear system ()3.ip 
that relies on the sparsity structure of the matrix only. In the absence of rounding 
errors, it would be exact. When the matrix A is of size A x A, the scheme of this 
section requires 0{N'^) floating point operations and 0{N) memory. This makes 
the scheme significantly slower than well-known 0{N^/'^) schemes such as nested 
dissection. (We mention that 0(A^/^) is optimal in this environment.) The only 
purpose of the scheme presented in this section is that it can straight-forwardly be 
accelerated to an 0(A) or 0{N\og^ N) scheme, as shown in Section |H 

Ordering the A points in the grid in the spiral pattern shown in Figure 13.11 the 
matrix A in equation ()3.ip has the sparsity pattern shown in Figure [321 We next 
partition the grid into n concentric squares, and collect the nodes into index sets 
Ji, J2, ... , Jn accordingly. In other words, 

Ji = {1,2,3,4}, 
J2 = {5, 6, ... ,16}, 



•Jn 



{(2n- 1)2 + 1, (2n-l)2+2, 



(2n)2}. 



For K, A G {1, 2, . . . n}, we let A^^x denote the submatrix of A formed by the inter- 
section of the rows with the J\ columns. Then the linear system (13. ip takes on 
the block-tridiagonal form (also shown in Figure [32]) 



(3.2) 



where x and b have been split accordingly. 

The equation (13. 2p can now easily be solved by eliminating the variables one by 
one. Using the first equation to eliminate xi from the second one, we obtain the 
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Figure 3.2. The sparsity pattern of A in ^7i\i for = 100. 
following system of equations for the variables X2, . . . , x„: 



A22 


A23 


•• 





A32 


A33 


A34 ■■ 








A43 


A44 ■■ 











•• 


A 



(3.3) 



where ^422 = A22 - A21A., A12 and 62 
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62 — A2iA^^bi. The process used to 
eliminate xi can easily be continued to eliminate the first n — 1 blocks. This leaves 
us with the (8n — 4) x (8n — 4) system 

Ann Xn — ^n; 

which we solve directly to obtain Xn- Once Xn is known, we compute by solving 
the system 

^n— 1,71— 1 Xn—1 — bn—l ^ra— l,n 

The remaining Xj's are computed analogously. The entire process is summed up in 
Algorithm I. 

We note that while all matrices Af^\ are sparse, the matrices j4kk are dense. This 
means that the cost of inverting A^^^ in each step of Algorithm I is O(k^). (Note 
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Algorithm 1: 


(1) 


All = All and bi = bi. 


(2) 


tor K = 2 : n 




A —A —A 1 A^^ A 1 


(4) 




(5) 


end 


(6) 




(7) 


for K = {n-1) : (-1) : 1 


(8) 


= (bn - x^+i) 


(9) 


end 



Figure 3.3. This algorithm directly solves the tridiagonal system 
of equations (|3.2p . 



that the remaining matrix-matrix operations involve matrices that are diagonal or 
tri-diagonal and have negligible costs in comparison to the matrix inversion.) The 
total cost T^^^lf" is therefore 

n 

rpdense \ ^ ,.3 ^4 t\t2 
^total ~ 2^ ~ ~ ^ • 

K = l 

Remark 3.1. We note that the matrices are not merely artificial objects of the 
numerical algorithm. In fact, A~^ is the matrix that maps the load on the boundary 
to the potential on the boundary. In many environments, the interior of the grid 
is unloaded {i.e. b^ = when k = 1, 2, . . . , n — 1). In such cases, the operator 
A~^ is the solution operator of the original problem. Since none of the intermediate 
matrices are required in this environment, the algorithm only requires 0{N) 
memory. 

4. A FAST DIRECT SOLVER 

The computational cost of Algorithm I consists almost entirely of the inversion 
of the dense matrices j4kk- This cost can be greatly reduced whenever is a 
"compressible" matrix in the sense described in Section [2j It is not the purpose 
of this paper to classify exactly when this happens; we will simply note that A 
is typically compressible in this sense when it results from the discretization of 
an elliptic PDE, while it is typically not compressible when it results from the 
discretization of a highly oscillatory wave equation. We also note that the condition 
that A result from the discretization of an elliptic PDE is very far from necessary, 
as the numerical examples in Section [6] demonstrate. 
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In cases where each ^4^^ is what we in Section [2] called in "HSS" -matrix, very 
simple inversion schemes (see Appendix]^ are available for computing an approxi- 
mation to in 0{k log^ k) arithmetic operations. The total computational cost 
of Algorithm I is then 

n 

^lotfl ~ E (log ~ (log ~ ^ (log ^)'- 

K = l 

This scheme requires 0{N log A^) memory to store the entire approximation to A~^. 
Such a scheme has been implemented and the computational results are given in 
Section [H 

We note that the scheme is particularly memory efficient in environments where 
the domain is loaded on the boundary only, and only the solution at the boundary is 
sought, cf. Remark [3 .11 In such situations, the operators need not be stored, and 
the inversion algorithm only requires 0(\/iV log A) memory. Moreover, if equation 
(j3.ip is to be solved for several different right hand sides, subsequent solutions are 
obtained almost for free via application of the pre-computed operator A~^. 

Finally we note that while it would require 0(A) memory to store A itself, the 
scheme only accesses each entry of A once. This means that these elements can 
either be computed on the fly (if given by a formula) , or read sequentially from slow 
memory ("tape"). 

Remark 4.1. In cases where each Schur complement A,^^ is not only compressible 
in the "HSS" -sense, but also in the "SSS" -sense, the cost of approximately inverting 
Akk can be reduced from O(k^) to 0(k), see Section [2l The total cost of Algorithm 
I is then 

n 

K = l 

This is typically the case when A results from the discretization of an elliptic partial 
differential equation. 

5. Generalizations 

The scheme presented here can in principle be adapted to more general grids 
in two and three dimensions. The generalization to other difference operators on 
uniform square grids is trivial. Other two-dimensional grids that are uniform in the 
sense that they can easily be partitioned into a sequence of concentric annuli can 
also quite easily be handled, and we expect the performance of such schemes to be 
similar to the performance reported in Section [6l 

For grids arising from adaptive mesh-refinement, or involve more complex geome- 
tries, it is still possible to construct 0(Alog'^ A) inversion schemes; but they will 
very likely require algorithms involving a broader palette of operations on compress- 
ible matrices such as matrix-matrix multiplications. What is interesting about the 
specific method given here is that it requires only matrix-inversions and diagonal 
updates. 
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6. Numerical examples 

The 0(A^log^ A^) numerical scheme described in Sections [3] and H] has been imple- 
mented and tested on a conduction problem on a square uniform grid. We assigned 
each bar in the grid a conductivity drawn from a uniform random distribution on 
the interval [1, 2]. For a range of grid sizes between 50 x 50 and 1000 x 1000, we 
computed the operator described in Section O The computational cost, the 
amount of memory required, and the accuracies obtained are presented in Table EH 
The following quantities are reported: 

^invert Time required to construct An (in seconds) 

Tappiy Time required to apply (in seconds) 

M Memory required to construct A^ (in kilobytes) 

ei The largest error in any entry of A~^ 

62 The error in P-operator norm of A~^ 

63 The /^-error in the vector A~^ r where r is a unit vector of random direction. 

64 The /^-error in the first column of A~^. 

We estimated ci and 62 by comparing the result from the fast algorithm with the 
result from a brute force calculation of the Schur complement. We estimated 63 and 
64 by solving equation ()3.ip using iterative methods. 
Some technical notes: 

• The experiments were run on a 2.8GHz Pentium 4 PC with 512Mb of RAM. 

• Off-diagonal blocks in the HSS-representations were represented to the fixed 
accuracy e = 10"''. 

• The ranks used in the off-diagonal blocks was allowed to vary from block to 
block (it was determined adaptively). 

• The code used is written in a Matlab-FORTRAN hybrid. It is not at all 
optimized. Significant gains in efficiency should be obtainable by choosing 
block sizes in more intelligently than we did. 

The scaling of Tinvort , ^appiy , and M with is displayed in Figure 16.21 These 
tables appear to support our claims regarding the performance of the scheme. The 
values of ei, 62, 63, and 64 for different values of are shown in Figure [631 The 
table appears to indicate that errors grow as the square root of (in other words, 
linearly with the number of steps performed in Algorithm 1). This means that we 
could easily keep the error constant by very moderately increase £ as the problem 
size gets larger. 

Figure [6741 gives the time tkappa (in seconds) required to perform step k of the fast 
version of Algorithm 1. The large jumps in the curve correspond to repartitioning 
of the HSS-matrices. (The jaggedness of the curve gives an indication of how poorly 
optimized the code is.) 
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N 


^invert 


Tapply 


M 


ei 


62 


63 


64 


10000 


5.93e-l 


2.82e-3 


3.82e+2 


1.29e-8 


1.37e-7 


2.61e-8 


3.31e-8 


40000 


4.69e+0 


6.25e-3 


9.19e+2 


9.35e-9 


8.74e-8 


4.71e-8 


6.47e-8 


90000 


1.28e+l 


1.27e-2 


1.51e+3 






7.98e-8 


1.25e-7 


160000 


2.87e+l 


1.38e-2 


2.15e+3 






9.02e-8 


1.84e-7 


250000 


4.67e+l 


1.52e-2 


2.80e+3 






1.02e-7 


1.14e-7 


360000 


7.50e+l 


2.62e-2 


3.55e+3 






1.37e-7 


1.57e-7 


490000 


1.13e+2 


2.78e-2 


4.22e+3 










640000 


1.54e+2 


2.92e-2 


5.45e+3 










810000 


1.98e+2 


3.09e-2 


5.86e+3 










1000000 


2.45e+2 


3.25e-2 


6.66e+3 











Figure 6.1. Table summarizing the computational experiment de- 
scribed in Section [6l 
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Figure 6.2. Plots of (a) Tinvert/^^ versus iV, (b) Ts,pp\y/VN versus 
A^, (c) M/VN versus N. 

Appendix A. An O(nlog^n) inversion scheme for HSS-matrices 

A recursive fast inversion scheme for HSS matrices follows immediately from the 
following formula for the inverse of a 2 x 2 block matrix: 

" ^11 Ai2 ^ 

{An - A12 ^22^ A2iy^ -{An - ^12 ^21)"^ ^12 A^^ 

_ -A^^ A21 {An - A12 A^^ A2iy^ A:^^ + A^^ A21 {An - Au A^^ A21) Au A^. 

From the formula, we immediately get Algorithm 2, displayed in Figure lA.ll To 
see that this algorithm has complexity O(nlog^n), we note that the matrices A12 
and A21 all have low rank. This means that the matrix-matrix multiplications that 
occur on lines (7) and (9) in fact consist simply of a small number of multiplications 
between HSS-matrices and vectors. Moreover, the matrix additions in lines (7) and 
(9) are in fact low-rank updates to HSS-matrices. 




Figure 6.4. Plot of versus k. 
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Algorithm 2: 



(1) function B = mvert_HSS_matrix{A) 



(2) if {A is "small") then 

(3) B = 

(4) else 

(5) Split ^=[;1^^ 

(6) X22 = invert_HSS_matrix{A22) 

(7) Fii = All - A12 X22 A21 

(8) Xii = invert_HSS_matrix{Yii) 

(Q\ Q — "'^ii — -'^ii ^12 X22 

—X22 A21 Xii X22 + X22 A21 Xii A12 X22 

(10) end if 



(11) end function 



Figure A.l. Algorithm for inverting HSS matrices. Note that An, 
A22, Xii, X22, and Yii are all HSS-matrices, and that A12 and A21 
are low-rank matrices. 



